AD-A246  284 


Mesh  and  Torus  Chaotic  Routing 


Kevin  Bolding  and  Lawrence  Snyder 

Department  of  Computer  Science  &  Engineering 
University  of  Washington 
Seattle.  WA  98195 

Technical  Report  91-04-04 
January,  1992  (update) 


Mesh  and  Torus  Chaotic  Routing 


Kevin  Bolding  and  Lawrence  Snyder 

Department  of  Computer  Science  &  Engineering 
University  of  Washington 
Seattle,  WA  98195 

Technical  Report  91-04-04 
January,  1992  (update) 


04^ 


^eCUniTY  CUASSiriCATlOM  Of  this  PAOe  Omim 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


t.  REPORT  NUMBCR  2.  OOVT  ACCESSION  NO.  S.  RCCIPIENT’S  CATALOG  NUMICR 

91-04-04 


REPORT  DOCUMENTATION  PAGE 


4.  TITLE  SuMllt) 


Mesh  and  Torus  Chaotic  Routing 
(updated  version) 


S.  type  Of  REPORT  A  PERIOD  COVERED 

Technical 


«.  PERfORMINO  ORG.  REPORT  NUMRER 


7.  AUTHORfO 


Kevin  Bolding  and  Lawrence  Snyder 


1.  CONTRACT  OR  GRANT  NUMRER(«> 

N00014-91-J-4041 


ANIZATION  NAME  ANO  ADDRESS 


Northwest  Laboratory  for  Integrated  Systems 
University  of  Washington 

Dept,  of  Comp.  Science,  FR-35  Seattle,  WA  9819! 


II.  CONTROLLING  Of  PICE  NAME  ANO  ADDRESS 


IS.  REPORT  DATE 


DARPA-ISTO 

1400  Wilson  Boulevard 


Sa.  OECLASSIflCATION/DOWNORAOINO 
SCHEDULE 


1/16/92  (update) 


IS.  number  OP  PAGES 

19 


monitoring  agency  name  a  AOORESS<t(  dllltfni  Inm  Canitolling  Olllf)  IS.  SECURITY  CLASS,  (ml  Mm  rapart; 

Office  of  Naval  Research  -  ONR 
Information  Systems  Program  -  Code  1513:  CAP 
800  North  Quincy  Street 
Arlington,  VA  22217 


S.  DISTRIBUTION  STATEMENT  (ml  Uil*  Hmgmn) 

Distribution  of  this  report  is  unlimited. 


17.  DISTRIBUTION  STATEMENT  (ml  Ihm  mbmlimtt  mnimtmd  In  Bleek  20,  II  dlllmtmnl  Item  RmpmtO 


19>  KEY  WORQS  (CmUtHi0  on  fororoo  o/tfo  It  nocoooorr  •nd  ld9nUtr  by  block  numbor) 

chaotic  router,  mesh  network,  torus  network 


20.  ABSTRACT  (Cenilnum  on  rmrmrmm  midm  II  nmtmmmmrr  mnd  Idmntllr  Of  Slock  nuaiAor; 

The  chaos  router  is  an  adaptive  nonminimal  message  router  for  multicomputer 
that  is  simple  enough  to  compete  with  the  fast,  oblivious  routers  now  in  use  in 
commercial  machines.  It  improves  on  previous  adaptive  routers  by  using 
randomization,  which  eliminates  the  need  for  complex  livelock  protection  and 
speeds  the  router. 

The  two-dimensional  chaos  router  is  shown  to  be  theoretically  sound  and 
physically  realizable.  Extensive  simulation  studies  compare  chaos  routing  with 


fORM 
1  JAN  72 


EDITION  OP  I  NOV  Ai  1$  OBSOLETE 
S/N  0I02-Lf -014.6601 


SECURITY  CLASSIPICATION  OP  THIS  PAOE  (PR#*!  Dmim  tntmrmd> 


^iesh  and  Torus  Chaotic  Routing 


Kevin  Bolding  and  Lawrence  Snyder 
Department  of  Computer  Science  and  Engineering 
U niversity  of  Wcishington.  Seattle 

Abstract 

The  chaos  router  is  an  adaptive  nonminimal  message  router  for  muiticomputers  iliat  is 
simple  enough  to  compete  with  tlie  fast,  oblivious  routers  now  in  use  in  commercial  machines. 
It  improves  on  previous  adaptive  routers  by  using  randomization,  which  eiiminates  tlie  need  for 
comple.'c  livelock  protection  and  speeds  the  router. 

The  two-dimensional  chaos  router  is  shown  to  be  tiieoreticaily  sound  and  physically  realiz¬ 
able.  E.'ciensive  simuiation  studies  compare  chaos  routing  with  oblivious  and  deflection  routing 
in  mesh  and  torus  networks.  Chaos  routing  is  shown  to  be  competitn’e  for  mesii  networks  and 
■^uperior  for  torus  networks.  This  high  performance  is.  perhaps.  une.Kpecteo  for  ihe  mesii  since 
there  is  no  finite  bound  on  the  (ielivery  time  of  any  messasie. 


1  Introduction 

(.'haotic  routing  is  a  randomizing,  adaptive,  message  routing  technique  that  has  previously  been 
shown  (in  simulations)  to  be  effective  for  the  binary  n-cube  i hypercube)  topology  [KS91].  The 
technique  is  nonminimal.  i.e.  messages  do  not  necessarily  take  minimal  paths  to  their  destinations, 
and  randomization  plays  a  critical  role  in  preventing  livelock.  i.e.  in  preventing  messages  from 
continually  circulating  in  the  network  without  being  delivered  [KS901.  Though  the  principles  apply 
as  well  to  batched  message  communication,  chaotic  routing  assumes  a  continuous  workload  where 
messages  are  presented  at  the  nodes  for  injection  into  the  network  at  random  real  (or  fine-grain 
discrete)  times.  Routing  decisions  are  made  locally  in  the  routing  nodes  i)a.sed  on  the  destination 
address  stored  in  liie  headers  of  the  messages  and  the  avaiiability  of  outaoing  ciianiiels.  .Messaaes 
can  "cut- through"  nodes  if  an  outgoing  channei  is  immediateiv  nvaiiable.  but  thcv  may  also  du 
-cored  in  the  node  if  nil  outaoiag  channeis  are  blocked,  motivating  short,  e.a.  -0  flit,  messages. 

Chaotic  routing's  success  on  the  hypercube  suggests  that  it  might  be  elfective  for  other  topoio- 
aies.  .\etworks  of  low  dimension  are  important  because  the  trend  in  parade!  computer  design  i- 
towards  mesh  and  torus  based  communication  structures  .such  a.s  in  the  Intel  Paragon,  the  Tera 
computer,  and  tlie  (.'aitecli  .Mosaic.  But  applying  "chaos"  in  two  dimensions  jioses  several  prob¬ 
lems.  Tii'St.  chaotic  routing  I'plies  on  theoretical  foundations,  the  theorems  of  which  have  only  been 
proved  for  hypercubes.  Tliis  is  easily  remedied  by  the  results  in  .Vppendi.x  B.  Tiie  .-''cond  problem 
is  subtle  and  applies  to  any  iionininimal  adaptive  router. 

ill  a  mesh  with  uniform  random  traffic,  there  is  a  "hot  spot"  in  the  '’“luer  of  the  mesh.  Titat 
is.  the  shortest  me.ssage  paths  between  two  points  tend  to  cross  the  center  of  the  mesh,  causing  the 
resources  in  the  center  of  the  network  (wires,  buffers,  etc.)  to  be  more  iieaviiy  used  (see  Figure 
1).  .-Ml  adaptive  routers  will  try  to  use  these  paths,  but  nonminimai  adaptive  routers,  when  they 
encounter  congestion,  will  try  to  "deroute"  a  message  away  from  the  congestion.  In  such  cases  liie 
hot  spot  can  act  as  a  barrier  off  of  which  messages  can  "bounce";  Tliat  is.  tlie  forward  patlis  arc  all 
congested,  the  me.ssage  is  deroiited  "backwards",  and  starts  forward  asain.  not  iiaving  moved  (or 
been  able  to  move)  away  from  the  congestion,  rhough  all  nonminimai  adaptive  routers  are  subject 
to  this  type  of  beliavior.  the  alfort  on  performance  varies  dopentiiiiir  on  tiie  ivpe  ot  router. 
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Figure  1:  Average  injection  delay  for  a  256-node  niesli. 

Priority  adaptive  routers  time  stamp  each  message  and  routing  is  governed  i)y  tlie  rule:  oldest 
iuessage  first.  Thus,  messages  bouncing  off  the  hot  spot  will  eventually  age  enough  to  be  routed 
flirpugh  it.  Matters  are  not  so  certain  for  the  chaos  router,  however.  The  primary  advancement  oi 
dihotic  routers  fKSOOj  is  that  they  eliminate  the  time  stamping  and  the  time  consuming  prioritiza¬ 
tion.  replacing  it  with  a  reliance  on  randomization.  But  there  is  no  mechanism  that  can  assure  the 
delivery' of  a  message  in  a  fixed  finite  time.  A  message  can  continue  bouncing  off  the  hot  spot  for  an 
arbitrarily  long  period  df  time.  Because  of  the  probabilistic  iivelock  freedom  proved  in  Appendix 
4;  we  know  that  the  probability  that  the  message  has  not  been  delivered  in  t  steps  goes  to  zero  as  / 
increases..  So  we  can  be  confident  that  the  message  will  be  delivered  eventually.  But  it  could  take 
a' very  very  long  time,  leavirig  us  with  the  question:  Does  chaotic  routing  work  for  the  mesh? 

In  this  paper.  Ijesides  proving  the  "necessary  theorems"  for  two-dimensionai  ciiaotic  routine, 
we  present  simulation  results  comparing  chaotic  routers  with  oblivious  louters  and  deflection.  ui 
"iiot  potato."  routers  on  the  mesh  and  torus  topologies  of  sizes  64.  256  and  1024  nodes  for  l)oili 
uniform  random  and  hot  spot  loads.  Three  highlights  are  worth  noting: 

•  On  tile  me.sh.  chaotic  routing  performs  as  well  as  oblivious  and  deflection  routing  in  throueli- 
'■  put  and  latency  for  uniform  traffic. 

Thus,  chaotic  routing  does  work  on  a  mesh,  and  in  fact  works  about  as  well  as  other  routers  wiieii 
the  traffic  is  uniform, 

•  On  the  mesh  chaotic  routing  performs  better  than  oblivious  and  deflection  routing  in  through¬ 
put  and  latency  for  nonuniform  hot  spot  traffic. 

Since  it  is  likely  that  programs  exhibit  nonuniform  traffic  patterns  the  performance  in  such  cases 
is  perhaps  more  significant. 

•  On  the  torus  chaotic  routing  is  decidely  superior  to  oblivious  and  deflection  routing  in  througii- 
put  and  latency. 

The  torus  has  better  bisection  bandwidth  and  better  worst  case  path  length  than  a  mesh  ol  similar 
-ize  at  the  cost  of few  extra  wires.  Its  icrtex  Irnnsuive  property  aids  the  chaos  router  in  givina 
it  significantly  better  performance.  The  chaotic  torus  router  is  the  best  two  dimensio:,al  packet 
router  to  our  knowledge  and  thus  a  candidate  for  the  next  generation  parallel  computers. 


2  Relationship  to  Previous  Research 

Borodin  and  Hopcroft  [BH85)  use  the  term  oblivious  to  refer  to  routers  for  which  the  path  of  any 
-message  is  completely  determined  by  its  (source,  destination]  pair.  They  proved  that  oblivious 
routers  in  an  N  node,  d  degree  topology  require  \/S steps  to  route  some  permutations.  The 
poor  worst  case  performance  and  their  fault  intolerance  would  doom  oblivious  routers  for  use  in 
multicomputers  were  it  not  for  the  fact  that  they  are  e.Ktremely  simple,  and  thus  fast.  .Accordingly, 
oblivious  routers  are  the  state-of-the-art  for  MIMD  multicomputers  such  as  those  built  by  Intel. 
.Afnetech  and  NCUBE.  Dally,  Seitz  and  Flaig  introduce  oblivious  routers  of  the  type  considered 
here  for  the  mesh  (Fla87)  and  torus  (DS86]  topologies. 

Randomization  was  first  applied  in  the  context  of  message  routing  by  Valiant  and  Brebner 
[VBSl].  though  in  a  way  quite  different  from  the  chaotic  approach.  Their  technique  —  select  a 
random  intermediate  destination  for  every  message,  route  the  message  to  that  destination  and  then 
Oh  to  the  true  destination  —  was  applied  to  batched  routing  in  a  hypercube.  It  could  obviously  l)e 
applied  continuously  [CS861  and  in  two-dimensional  topologies.  The  main  difficulty  with  this  type 
of  randomization  is  that  it  doubles  the  expected  path  length  of  any  message. 

An  adaptive  mesh  router  was  proposed  by  Ngai  and  Seitz  [NS891.  but  it  differs  from  the  chaotic- 
approach  by  using  timestamps  and  prioritization  to  prevent  against  livelock.  Comparisons  between 
prioritized  and  chaotic  routers  have  been  performed  iKSDll.  Adaptive  wormhole  routing  using 
virtual  channels  has  been  studied  by  Duato  (Dua91]. 

•’Hot  potato”  or  deflection  routing  is  another  scheme  capable  of  adaptive  routing  [SmiSl.  .\Iax89. 

F'S91.  Smi89j.  The  approach  is  synchronous  and  the  time  step  is  long  enough  to  transmit  an  entire 
packet.  .At  each  step  the  incoming  messages  are  paired  with  outgoing  channels  and  are  transmitted 
in  the  next  step.  The  pairing  is  done  in  a  variety  of  ways:  Certain  deterministic  algorithms 
attempt  to  maximize  the  number  of  messages  sent  out  productive  channels,  while  others  use  a 
greedy  algorithm  with  random  selection.  Those  messages  not  receiving  a  productive  channel  are 
"deflected."  i.e.  derouted.  out  any  available  channel.  Deflection  routing  differs  from  chaotic  routing 
in  several  ways:  Chaotic  routing  is  not  batched,  i.e.  does  not  require  all  headers  to  ite  present  at 
once,  thus  permitting  last  seif-timed  or  high  clock  rate  implementations,  and  better  utilization 
of  channels  since  messages  can  cut  through,  i.e.  messages  can  be  "in"  multiple  routers  at  once, 
(fhaotic  routing  permits  messages  to  wait  for  forward  traffic  to  clear,  thus  reducing  time  consuming 
deroutes  which  necessarily  delay  the  packet  at  least  two  "message  times".  Pausing  for  traffic  to  clear 
cu.shions  the  affects  of  bursts.  Finally,  chaotic  routing  resorts  to  derouting  only  under 
of  high  load,  when  slower  performance  is  inevitable. 

3  Chaos  Router  Design 

The  chaotic  router  studied  here  is  a  two  dimensional  variant  of  the  hypercube  chaotic  router 
fKS911.  The  basic  tiesign  of  the  chaos  router  is  similar  to  a  typical  oblivious  virtual  cut-through 
router,  with  input  and  output  frames  connected  by  a  crossbar  switch,  and  hardware  to  increment  or 
decrement  the  headers  of  messages  as  they  pa.ss  through  (see  Figure  2).  Two  primary  distinctions 
exist,  though.  The  first  is  that  the  routing  relation  no  longer  specifies  a  single  channel  to  traverse 
ne.xt.  but  instead  a  set  of  equally  profitable  channels.  The  first  available  profitable  ciiannei  wiil 
be  chosen  for  routing.  The  .second  distinction  is  the  addition  of  a  small  (o  message)  buffer,  the^^^ 
MultiQueue.  which  holds  messages  for  which  no  profitable  channels  are  immediately  available.  Since 
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the  buifer  space  is  off  of  the  critical  resource  path,  messages  in  the  queue  do  not  block  messages 
l)ehind  them.  .Messages  enter  the  queue  along  a  separate  crossbar  whenever  they  have  been  denied 
access  to  any  profitable  channel  long  enough  for  the  entire  message  body  to  have  arrived  in  the 
input  frame.  .A.lso.  in  order  to  prevent  deadlock,  when  a  message  is  read  from  the  queue  into  the 
output  frame  for  channel  /  and  the  input  frame  for  channel  i  is  full,  the  message  in  the  input  fran.e 
is  immediately  read  into  the  queue.  .Messages  cannot  enter  the  queue  from  the  injection  frame  aim 
messages  wnich  are  awaitine  the  avaiiability  of  the  ejection  buffer  do  not  enter  tlie  queue,  as  they 
■.viil  be  consumed  bv  the  processor.  Whenever  an  output  cnannei  that  is  proritable  for  a  messace  in 
•he  queue  becomes  avaiiabie.  me  first  message  in  the  queue  which  can  use  that  cnannei  is  sent  from 
:he  queue  through  another  crossbar  to  the  output  frame  for  that  channel.  When  several  messages 
•  an  profitably  use  a  channel  at  the  same  time,  priority  is  given  to  messages  in  the  queue  tin  FIFO 
order):  among  competing  input  frames  messages  are  chosen  randomly. 

■■V  critical  situation  occurs  when  a  message  is  specified  to  be  sent  to  the  queue,  but  the  queue  is 
completely  full.  In  such  a  .situation,  a  message  is  randomly  selected  from  the  queue  to  be  derouted 
along  the  first  available  channel  so  that  room  will  be  created  in  the  queue  for  the  newly  arriving 
message.  Derouting  provides  an  additional  factor  of  adaptivity  to  the  chaos  router  and  allows  the 
use  of  a  packet-e.xchange  jirotocol  for  deadlock  prevention  j.N’SSQl. 

4  The  Network  Model 

The  performance  of  different  routing  schemes  varies  much  accordins  to  the  model  of  rhe  iietworK 
being  studied.  For  the  studies  of  chaos  routing,  we  use  the  following  network  inotiei: 

The  network  is  a  regular  two-dimensional  network  of  bi-connected  noUes.  Between  each  pair  ot 
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adjacent  nodes  in  the  network  there  is  a  channel  consisting  of  control  wires  and  a  single  data  bus. 
The  data  bus  is  shared  between  the  two  directions,  with  arbitration  occurring  between  message 
boundaries.  The  messages  are  fi.xed-size  packets  consisting  of  a  header  and  several  data  words. 
The  width  of  the  data  bus  determines  the  size  of  a  flit  which  is  the  amount  of  data  which  can  be 
transmitted  over  the  data  bus  in  one  cycle.  We  parameterize  the  packet  size  in  our  studies  in  terms 
of  the  number  of  flits  per  packet.  L.  Thus,  for  a  16  bit  wide  bus.  a  20-flit  message  would  contain 
a  16- bit  header  and  304  bits  of  data.  We  constrain  our  e.xperiments  to  messages  of  size  20  flits, 
^vhich  is  consistent  with  e.xisting  multicomputer  designs. 

We  study  the  two-dimensional  mesh  and  the  two-dimensional  torus  in  this  investigation.  To 
judge  changes  in  performance  with  network  size,  we  compare  networks  of  64.  256.  and  1024  nodes. 

5  Routers  Studied 

We  study  three  routers  in  this  paper:  an  oblivious  router,  the  chaos  router,  and  a  deflection 
router.  .Most  current  multicomputers  use  some  variant  of  obiivious  routing.  We  chose  a  virtual  cut- 
through  oblivious  router  with  input  and  output  queueing  to  provide  a  baseline  for  current  routing 
techniques.  We  provide  resuits  from  a  deflection  router  based  on  fFS91]  to  provide  another  baseline 
for  comparison.  Finally,  we  study  the  chaos  router  as  presented  in  Section  .3. 

5.1  Oblivious  Router 

The  oblivious  router  studied  here  is  based  upon  the  Kermani  and  Kleinrock  [KK79]  virtual  cut- 
through  router.  Speciflcally.  the  router  consists  of  a  set  of  input  and  output  frames  and  a  crossbar 
switch  which  connects  each  input  frame  to  every  output  frame.  Each  channel  has  one  input  frame 
and  one  output  frame,  each  capable  of  holding  e.xactly  one  fi.xed-size  messaged  The  injection  and 
delivery  channels  also  have  an  input  frame  and  an  output  frame,  respectively,  which  are  connected 
:o  the  crossbar  as  well.  Operation  of  the  router  proceeds  in  virtual  cut-through  fashion:  whenever 
.'i  inessaee  arrives  in  an  input  frame,  it  i.s  immediatelv  routed  to  the  output  frame  for  the  next 
channel  on  its  path  to  its  destination.*  if  that  output  frame  is  available.  It  is  not  necessary  to 
receive  the  entire  message  in  an  input  frame  before  the  header  is  .sent  to  the  output  frame.  If  the 
output  frame  is  not  iminediateiy  available,  the  message  wiil  wait  in  the  input  frame  until  it  become.s 
available,  i.docking  any  messages  ijehind  it  if  necessary.  Operation  of  the  ciianneis  proceeds  in  :i 
similar  demand-driven  fashion. 

5.2  Deflection  Router 

Deflection  routing  is  an  adaptive  routing  scheme  in  which  messages  arriving  at  a  node  are  guaranteed 
to  leave  the  node  in  tlie  next  routing  cycle.  .-\n  attempt  to  assign  each  message  to  a  channel  which 
reduces  it  distance  to  its  destination  is  made,  giving  preference  to  messages  with  only  a  single 
profitable  direction,  followed  by  randomly  assigning  any  remaining  messages  to  the  remaining  free 
outgoing  channels.  The  .--cheme  does  not  quite  fit  the  network  model  presented  in  Section  1. 

®For  the  oblivious  lorus  router,  virtual  channels  .ire  implemented  by  giving  each  phy.sical  channel  two  input  ami 
output  frames. 

•Since  this  is  an  oblivious  router,  there  will  be  only  one  possible  output  channel  at  e.acii  routing  step.  In  order  to 
prevent  deadlock,  the  channci.>>  must  be  traversed  in  order  of  dimension. 
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as  channels  must  always  be  available  and.  thus,  cannot  be  shared.  We  compensate  for  this  !iy 
clividihg  each  deflection  routing  channel  into  two  uni-directional  channels  of  one  half  the  width 
of?  the  chaos  and  oblivious  routers.  .\lso.  the  deflection  protocol  requires  that  all  the  headers  of 
incoming  messages  arrive  at  the  same  moment,  which  is  generally  accomplished  in  their  network 
■models  by  using  very  high  bandwidth  channels  capable  of  transmitting  an  entire  message  in  each 
fflit.  Since  our  model  includes  multiple-flit  messages,  a  routing  decision  may  occur  only  once  an 
entire  message  arrives  at  a  node,  resulting  in  a  store-and-forward  technique  without  virtual  cat- 
through;  Finally,  In  the  analytical  moc  presented  in  [FS91).  all  messages  which  arrive  at  a  single 
destination  node  are  removed  from  the  network  at  once.  In  our  simulations,  we  limit  the  delivery 
capability  to  the  bandwidth  of  a  standard  network  channel  (one  flit  per  cycle),  as  would  be  required 
in  a  realistic  implementation. 

6  TraflSc  Mode.ls 

111  on^er  to  compare  the  reiative  performance  of  the  different  routing  sciiemes.  a  svntlietic  workload 
is  applied  to  the  simulated  network  and  performance  measurements  are  taken.  The  choice  of  the 
workload  is  critical  when  trying  to  compare  the  schemes.  We  provide  simulation  results  for  two 
‘.workloads:  uniform  random  and  hot  spot  traffic. 

6.1  Uniform  Random  Traffic 

For  uniform  random  traffic,  each  node  presents  a  message  to  the  network  with  a  destination  chosen 
uniformly  randomly  from  each  of  the  nodes  in  the  network.  The  time  between  the  presentation  of 
messages  is  chosen  randomly  with  a  mean  time  based  on  the  simulated  applied  load.  The  load  is 
presented  as  a  fraction  of  the  ma.\imum  load  the  network  could  handle  if  there  were  no  resource 
conflicts.  This  is  computed  as  the  point  at  which  the  utilization  of  channels  cut  by  a  bisection  of 
the  network  reaches  100%  assuming  each  message  crosses  this  bisection  with  probabiiity  0.5.  If  all 
channels  of  the  network  were  utilized  100%  of  the  time  and  all  messages  traveled  on  the  shortest 
paths  available,  this  ina.\imum  throughput  wouid  be  obtained  under  uniform  random  traffic.  The 
maximum  applied  load  is  then  computed  as  the  minimum  inter-injeciion  period  for  each  network. 

For  the  network  model  presented  in  Section  4.  where  one  flit  can  be  transmitted  across  a  channel 
in  one  cycle,  the  minimum  inter-injection  period  per  node  is  cycles  for  .V-node  meshes  and 

L  for  .V-node  tori  with  messages  of  length  L  flits. 

6.2  Hot  Spot  Traffic 

.■\lthough  uniform  random  traffic  is  a  natural  model  of  network  traffic,  many  applications  used  on 
multicomputers  create  message  traffic  which  has  several  hot  spot  nodes  that  receive  considerably 
more  traffic  than  the  rest  of  the  network.  We  attempt  to  model  an  abstract  system  by  a  synthetic 
load  consisting  of  the  same  injection  load  as  uniform  random  traffic,  imr  with  tlie  destination 
distribution  skewed  in  the  following  manner:  ten  ’•hot"  nodes  are  chosen  at  the  beginning  of  the 
simulation,  each  being  four  times  as  iikeiy  as  the  other  .V-10  nodes  to  l)e  the  destination  of  a 
message.  Thus,  these  nodes  become  hot  spots  which  could  represent  nodes  that  are  used  for 
synchronization  or  locking  in  a  niulticornpi’ier  application.  The  total  loading  of  the  network  is  the 
.same  as  for  uniform  random  traffic. 


7  Simulations 

Simulations  for  the  networks  and  routers  studied  were  conducted  using  a  flit-based  simulator  written 
in  C.  The  simulations  were  based  on  the  cycle  time  unit.  One  cycle  is  the  time  necessary  to  transmit 
a-single  flit  across  a  channel.  Routing  decisions  can  be  made  in  a  single  cycle.  Thus,  if  a  message 
header  enters  a  router  at  cycle  t,  it  may  enter  the  next  router  as  early  as  cycle  t  -f  P. 

Simulations  were  run  by  applying  the  simulated  load  to  the  network  in  a  continuous  manner. 
.Statistics  were  computed  in  intervals  in  which  each  node  of  the  network  has  injected  at  least  50 
messages.  .Average  throughput  and  average  latency  were  computed  for  each  statistics  interval  and 
•convergence  was  determined  when  the  standard  deviation  of  both  of  the  measures  over  the  most 
recent  five  intervals  were  less  than  .3%.  The  results  presented  here  represent  the  averages  and 
standard  deviations  of  3  to  5  runs. 

The.statistics  reported  here  are  the  average  throughput  of  the  network  normalized  to  the  ma.x- 
nmum  throc-^ltput  under  uniform  random  load  and  the  average  latency  of  messages  in  cycles.  We 
define  latency  as.  the  time  from  presentation  of  a  message  to  the  network  until  the  message  has  been 
completely  removed  from  the  network  at  its  destination  (notice  that  this  does  not  include  source 
queueing  time). 

8  Simulation  Results 

As  described  earlier,  simulations  were  performed  on  mesh-  and  torus-connected  networks  of  64. 256. 
and  1024  nodes  using  random  traffic  and  *Miot  spot*’  traffic  for  each  of  the  three  routing  schemes 
studied.  The  average  throughputs  and  average  latencies  are  reported  here. 

To  gauge  performance,  we  concentrate  on  the  high-load  throughput  and  medium-load  latencies. 
For  low  loads,  all  routing  schemes  are  able  to  deliver  the  entire  applied  load  without  difficulty.  The 
point  at  which  the  network  saturates  and  the  network  is  not  able  to  keep  up  with  the  applied  load 
is  the  interesting  point  in  this  case.  .A.lso.  the  shape  of  the  throughput  curve  above  saturation  is 
important  -  i.e.  does  throughput  ever  decrease  with  increasing  applied  load?  Latency  is  a  more 
critical  issue  during  iower  load  periods.  .At  loads  above  saturation,  since  the  network  cannot  keep 
up  with  the  load  applied,  the  latency  of  messages  which  do  get  through  becomes  of  only  peripheral 
interest.  However,  when  the  network 's  operating  below  saturation,  it  is  latency  that  is  the  critical 
figure  of  merit.  Thus,  we  will  consider  throughput  .saturation  points  and  below-.saturation  latencie.s 
as  the  figures  of  merit  for  the  networks  studied. 

We  present  full  throughput  and  latency  curves  for  256-node  networks  with  chaos  and  oblivious 
routing.  We  graph  only  throughput  data  for  deflection  routing  because  the  store-and-forward  nature 
of  the  router  results  in  especially  high  latency  figures.  Since  the  shapes  of  the  curves  do  not  differ 
appreciably  over  different  network  sizes,  we  present  only  the  100%  load  throughput  and  50%  load 
latency  points  for  other  size  networks.  The  raw  data  is  given  for  all  networks  in  .Appendix  .A. 

8.1  Mesh  networks 

For  mesh  network.s  under  uniform  random  traffic,  ail  three  routing  schemes  give'simiiar  through¬ 
put  results  ( Figures  3  and  4).  The  throughput  reaches  .80-00%  of  the  ma.ximum  throughput  in  each 

'For  the  deflection  router,  since  the  entire  message  must  he  received  before  transmission  c.an  begin,  a  messaue 
entering  a  router  at  cycle  i  will  not  leave  until  cycle  l-r  i. 


256-ncKle  Mesh  ljuencv  iHotsixx  Traffic) 


Fiqure  -i:  256-iiode  mesl)  resuiis 


case  and  there  is  no  decline  in  throughput  ns  the  load  approaches  1009? .  The  latencies  tor  the  ol)iiv- 
ious  and  chaos  routers  remain  very  close  throughout  all  load  ranges,  with  the  chaos  router  giving 
slightly  better  values.  The  performance  of  the  adaptive  schemes  i.s  actually  lower  than  would  be 


■S 


cycles  horiualized  throughput 


network  size  (log  2) 


Latency  (Hotspot  Traffic) 


Figure  -t:  Mesh  tiiroiishput  ( 100%  applied  load!  and  latency  !  ?0%‘  applied  load)  V'.  network  size 

expected:  the  addiiioiiai  iiardware  gives  iittle  or  no  benerii  under  random  ioads.  liiis  is  the  re¬ 
sult  of  the  large  hot  spot  inherently  present  in  the  center  of  mesh-connected  networks  ( Figure  i  i 
which  creates  a  substantiai  barrier  to  cross-network  traffic.  Wliile  the  ol)livious  router  sends  mes¬ 
sages  straight  through  tlte  hot  spot,  even  if  slowly,  the  adaptive  routers  attempt  to  route  messages 
around  the  congested  center.  However,  since  the  area  is  so  large,  messages  tend  to  bounce  around 
the  periphery  for  long  periods  of  time,  resulting  in  very  long  paths  from  source  to  destination. 

When  hot  spots  are  added  to  the  mesh,  chaos  routing  becomes  distinctly  better  tlian  oblivious 
and  deflection  routing  for  small  networks,  with  the  benefit  declining  as  network  size  decreases 
(Figures  3  and  4)'’.  This  can  be  seen  .is  the  oblivious  throughput  increases  with  network  size  wiiilc 
the  chaos  throughput  remains  relatively  stable.  For  small  networks  the  oblivious  throughput  is 
especially  low.  resulting  in  iiigh  latencies  from  the  .idditionai  congestion.  This  i)ehavior  is  due 
to  the  fact  that  the  central  hot  spot  presents  a  more  formidable  barrier  in  larger  networks  -  for 
small  networks  the  ton  hoi  spots  influence  the  traffic  greatly  and  the  chaos  router  performs  bettor, 
but  .IS  the  network  grows,  the  central  hot  spot  dominates  the  traffic  flow  and  the  oblivious  router 


'Data  is  not  currently  .ivaiiabic  for  the  ij|-noiie  ilcflection  router 


pcrl'ormance  improvps.  Thus,  for  smaller  nftwork  sizes,  the  adaptivity  of  the  chaos  i outer  prove? 
useful  in  the  mesh,  but  this  advantage  diminishes  with  increasing  network  size. 

8.2  Torus  networks 

'^or  torus-con:  ectea  networks,  the  chao.^  router  .  erforms  significantly  better  thba  both  the 
oblivious  and  deflection  routers  in  all  respects  (Figures  o  and  6).  Since  a  torus  is  vertex-transitive, 
he.  the  network  appears  the  same  to  every  node,  traffic  is  uniformly  distributed  throughout  the 
network,  unlike  the  mesh.  This  translates  into  a  performance  advantage  for  the  cnaos  router, 
which  allows  messages  to  use  the  entire  net  ^'ork  without  the  constraints  of  oblivious  dimension- 
order  routing.  The  chaos  router  achieve'  near-maximum  throughput  under  random  tt..ffic  for  all 
network  sizes  considered,  while  the  oblivious  and  d, Section  routers  top  out  at  55-70VX  pertormance. 
Again,  the  latencies  remain  low  for  low  .0  mediu:..  leads,  indicating  very  superior  performance. 

.K  disturbing  property  of  the  torus  oblivic  .s  router  is  i.;.  ‘he  maximum  throughput  is  achieved 
at  less  than  the  maximum  load.  This  is  »'uc  .0  a  ‘‘stuibance  of  the  verte.x-transitivi.y  of  the 
network  introduced  by  the  addition  of  deadloca  Diev-  .t  ion.  Since  the  virtuai-ciiannei  deadlock 
prevention  scheme  applied  [DS871  distinguishes  certain  nodes  as  "special"  in  order  to  ureak  cycles, 
the  uniformity  of  the  iietv  ork  is  broken  aud  i>ot  spots  are  introduced  at  high  loads.  This  results  in 
the  degradation  of  thronp'  •  ut  as  load  is  ii!  .  i.ied.  The  chaos  and  deflection  routers  preserve  the 
uniformity  of  '.hc  network  i.ad  (:<■  not  exh.uiv  ..n.'  behavior. 

For  hot  spot  traffic,  the  ?  laptive  routeis  perform  well  and  the  oblivious  router  suffers  an  earlier 
leveling  of  throughput  thar.  with  rancloir.  rrairif-^.  The  advantage  of  chaos  routing  is  clearly  apparent 
here,  as  throughput  '.ud  h.teii'';.  a4e  only  mr  m.a'iy  affected  by  the  non-uniform  traffic  load.  Overall, 
the  chaos  router  is  cleMr.  uior  to  -.he  ••mviou.  aj  u  <(eflection  routers  for  the  torus  network. 

9  Conclusions 

We  have  presented  n  f.;o-dimensional  variant  o*’  the  hypercube  chaos  router  anu  shown  it  to  l)e  a 
.liable  router.  The  theoreticai  foundations  of  the  two-dimensional  router  have  oeen  presented.  .V 
working  design  of  ihe  i.aos  router  iias  been  given  which  is  capable  of  competing  with  oblivious 
routers  for  critical-path  .'crnplexity.  Th  ’  oerformance  of  the  chaos  router  is  comparable  to  oblivious 
routers  for  meshes  with  random  traffi"  and  better  with  hot  spot  traffic,  for  torus  networks,  the 
cliaos  router  performs  much  better  than  the  oblivious  and  deflection  routers. 
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Figure  5:  256-node  torus  results 
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Appendix  A:  Numerical  Results 

Data  for  64.  ‘256.  and  i0‘24-node  mesh  and  torus  networks  with  uniform  random  and  hot  spot  traffic.  Statistics 
presented  are  the  means  and  standard  deviations  for  normalized  throughput  and  latency  over  three  runs  usiiia 
different  random  number  seeds. 
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Appendix  B:  Theoretical  Considerations 

It  is  necessary  to  show  that  chaotic  routers  are  both  deadlock  free  and  livelock  free.  Deterministic 
deadlock  freedom  is  straightforward  for  routers  that  use  a  message  e.xchange  protocol  [XS89].  The 
case  analysis  for  the  two-dimensional  case  matches  the  hypercube  case  [KS901. 

Deterministic  livelock  freedom,  that  every  message  is  delivered  after  a  given  period  of  time, 
is  not  true  for  chaotic  routers.  However,  probabilistic  livelock  freedom  -  the  probability  that  a 
message  remains  undelivered  after  t  seconds  goes  to  zero  as  t  increases  -  is  true.  The  following 
sketch  of  th  proof  mirrors  the  hypercube  argument  (KS90). 

The  message's  path  through  the  torus  network  is  described  by  a  sequence  of  moves.  The  distance 
of  a  message  from  its  destination  is  the  Manhattan  distance,  which  can  be  at  most  v  -V  -  1-  Cl^ -riy. 
for  \/N  even,  every  move  either  increases  or  decreases  the  message's  distance  to  the  destination. 
The  probability  of  moving  closer  is  the  probability  of  being  routed  p  >  e.  which  is  established  in  a 
theorem  arguing  that  a  message  remains  in  the  multiqueue  a  bounded  amount  of  time  and  is  thus 
subjected  to  only  a  bounded  number  of  random  derouting  decisions  isee  [KS90!).  The  probability 
of  moving  further  is  r/  =  1  -  p. 

Let  us  define  a  game  as  a  .sequence  of  \/.V  moves.  .Message  M  starts  game  i  at  distance  «,  and 
finishes  at  distance  «(+i.  Let  /j  denote  the  event  that  M  was  not  delivered  during  game  i  and  «•; 
the  event  that  .1/  was  delivered  during  game  i. 

Let  Q(i)  be  the  probability  that  message  M  has  not  been  delivered  after  /  games.  Then 

Q{i)  =  =  PUi  I 

For  simplicity,  let  us  substitute  Fk  for  4.../2/1  ,  I  <  k  <  i  and  let  us  define  P{Ii  i  Fo)  =  P(li) 
and  P(w\  I  Fo)  =  P{tv\].  Then 


Q(i)  =  P(l,  I  F,--i )  •  ?(/,•.,  i  F_2 ) .  •  ■  P(h )  ( 1 ) 

Clearly,  F(/y  |  Fj-i)  =  I  -  Piwj  |  Fj_i),  1  <  ;'  <  i.  In  the  following  we  will  estimate 
P{Wi  I  F;_i).  Let  .b'j,/.  denote  the  event  that  message  M  starts  game  j  at  .Manhattan  distance  k 
from  its  destination.  Events  Sj,k  are  mutually  e.xclusive  and  one  of  them  necessarily  happens.  Thus 

P(  Wj  i  F,_i )  =  P{  WjSj.i  U  ..  U  ^  i  F._i )  =■ 

i/iV 

Pi  Wj  \  Fj-i)  =  Yi  PiWjS\i:  i  F,_t )  => 

;-i 

P(Wj  i  F,_,)  =  \S),kFj.i)  ■  P{Sj.k  i  F;_i). 

But  P(  Wj  I  Sj,kFj-i }  >  .  thus 


vCv 

F(aviF,_,)>c'^j;F1.5';.,|F,_,;. 

k=\ 

Since  PiF,.k  i  F,_i )  =  I  => 

PiiVj  I  Fj-\  \  => 


1 


(2) 


P(L  I  F._i )  <  1  - 

Finally.  (1).(2)  =>  0(0  <  (1  - 
Thus  the  probability  Q(i)  that  M  will  not  have  been  delivered  after  i  games,  where  ;  —  x  is; 

lim  0(0  =  (l-e'^)'  =  0 

The  probability  P(i)  that  M  will  be  delivered  after  i  games,  where  i  —  x  is: 

lim  P(i)  =  1. 


The  essential  feature  of  the  proof  is  the  condition  that  p  >  e.  Since  this  condition  holds  for 
meshes  on  the  edges  (for  the  available  edges),  the  theorem  has  the  mesh  topology  as  a  corollary. 
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